Introduction to Open Data Science. Happy to learn to connect Git and R even though the content of data analysis is familiar to me. I’ve used R MarkDown before and found it useful. In addition I’ve used Git before on Tietokantojen Perusteet course. However, it’s been a long time so recalling things is needed. Here’s the link to my IODS GitHub repository.
Today is the following day.
## [1] "Mon Nov 16 16:15:13 2020"
I wanted to remind me what I’ve done last time with R MarkDown. I found a nice data on exchange and forward rates. I make a table and a graph of Sterling/EUR exchange rate. I uploaded forward2.dat to git repository and I call my data set from there in order to plot the rate. I hide the R code to make the outcome more readable.
## Warning: `as_data_frame()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
| EXUSBP | EXUSEUR | EXEURBP | F1USBP | F1USEUR | F1EURBP | F3USBP | F3USEUR | F3EURBP | |
|---|---|---|---|---|---|---|---|---|---|
| Min. :1.073 | Min. :0.5827 | Min. :0.3567 | Min. :1.067 | Min. :0.5873 | Min. :0.3588 | Min. :1.061 | Min. :0.5941 | Min. :0.3624 | |
| 1st Qu.:1.507 | 1st Qu.:0.8876 | 1st Qu.:0.4855 | 1st Qu.:1.504 | 1st Qu.:0.8885 | 1st Qu.:0.4845 | 1st Qu.:1.501 | 1st Qu.:0.8926 | 1st Qu.:0.4841 | |
| Median :1.617 | Median :1.0738 | Median :0.5598 | Median :1.616 | Median :1.0774 | Median :0.5589 | Median :1.609 | Median :1.0855 | Median :0.5588 | |
| Mean :1.665 | Mean :1.0416 | Mean :0.6213 | Mean :1.663 | Mean :1.0447 | Mean :0.6203 | Mean :1.658 | Mean :1.0503 | Mean :0.6184 | |
| 3rd Qu.:1.756 | 3rd Qu.:1.1741 | 3rd Qu.:0.7107 | 3rd Qu.:1.753 | 3rd Qu.:1.1778 | 3rd Qu.:0.7091 | 3rd Qu.:1.748 | 3rd Qu.:1.1819 | 3rd Qu.:0.7051 | |
| Max. :2.443 | Max. :1.4222 | Max. :1.6002 | Max. :2.441 | Max. :1.4240 | Max. :1.5954 | Max. :2.433 | Max. :1.4278 | Max. :1.5863 |
I’m happy to see that R MarkDown is linked to LaTeX syntax! Here’s a maximization problem from my current research on the optimal mechanism design with enforcement: \[\begin{align} \max_{r(\cdot),t(\cdot),m(\cdot)} \mathbb{E} \left[ t(\theta) - K m(\theta) \right]\label{OBJ} \tag{MAX} \end{align}\] subject to \[\begin{align} &t(\theta) = \theta r(\theta) - V(\underline{\theta}) - \int_{\underline{\theta}}^{\theta} \mathcal{I}(s|s)ds \label{TAX} \tag{TAX} \\ &\mathcal{I}(\theta|\theta) \qquad \text{is nondecreasing} \label{IC} \tag{IC} \\ &\mathcal{I}(\theta|\theta) \geq 0 \label{IR} \tag{IR} \end{align}\] for all \(\theta \in \Theta\) with \(\mathcal{I}(\theta|\theta):=r(\theta) - m(\theta) \varphi\).
Today is
## [1] "Mon Nov 16 16:15:14 2020"
In this week we learn how to create data and analyse it. We read the data from the given URL. The meta descriptions can be found from here.
The dataset queries the relationship between learning approaches and students’ achievements in an introductory statistics course in Finland. The original data have 183 responders (observations) for 97 questions (variables). For our purposes, we subset the dataset to contain variables gender, age, attitude, deep, stra, surf and points. The variables are either integers or numerics. The meta-file gives the following description for each variable:
Next we omit the observations where the exam points variable is zero. After that we scale variables by the mean – that is, we divide each combination variable by its number of questions asked. So we have 166 observations for 7 variables left.
Let us next read the data we created earlier in the data wranling part:
# Read the data
learning2014 <- read.csv("/Users/teemupekkarinen/IODS-project/data/learning2014",row.names = 1)
To summarize our dataset’s content we run
# Structure
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ Gender : int 2 1 2 1 1 2 1 2 1 2 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ Deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ Stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ Surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
# Dimension
dim(learning2014)
## [1] 166 7
That is, we have the information about gender, age, attitude, and three different learning approaches (deep, strategic, and surface) of 166 responders. All of the variables are numeric and the combination variables (attitude, deep, stra, and surf) are means. For more detailed description we run:
# Summary
summary(learning2014)
## Gender Age Attitude Deep
## Min. :1.000 Min. :17.00 Min. :1.400 Min. :1.583
## 1st Qu.:1.000 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Median :2.000 Median :22.00 Median :3.200 Median :3.667
## Mean :1.663 Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:2.000 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :2.000 Max. :55.00 Max. :5.000 Max. :4.917
## Stra Surf Points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
We observe that there are almost twice more female responders than male responders: 56 men and 110 women. The average age of the responders is 26 and the youngest person is 17 years old and the oldest 55 years old. The age distribution is wide but skewed towards young people. As summarize we can say than a typical responder is a young female.
# Histrograms
Gender <- learning2014$Gender
hist(Gender,breaks=2)
sum(Gender==1)
## [1] 56
Age <- learning2014$Age
hist(Age)
mean(Age)
## [1] 25.51205
var(Age)
## [1] 60.31198
The average exam points is 28 with maximum 33 and minimum 7. The exam results are more or less normally distributed among all participants.
# Histrograms
Points <- learning2014$Points
hist(Points)
However, when we compare the points between men and female, we observe that the mean of exam points for men is slightly greater than for women. There are even more men with score 30 or 31 than women.
# Package
library(ggplot2)
# Histrograms
PointsMale <- learning2014$Points[Gender==1]
mean(PointsMale)
## [1] 23.48214
PointsFemale <- learning2014$Points[Gender==2]
mean(PointsFemale)
## [1] 22.32727
data <- data.frame(
type = c( rep("Male", 56), rep("Female", 110) ),
value = c(PointsMale, PointsFemale)
)
ggplot(data, aes(x=value, fill=type)) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
The mean of attitude, 3.14, is almost neutral 3. That is, on average responders have a bit positive attitude towards statistics. Also attitude is nearly normally distributed. Male responders have more positive attitude than female.
# Histrograms
Attitude <- learning2014$Attitude
hist(Attitude)
AttitudeMale <- learning2014$Attitude[Gender==1]
mean(PointsMale)
## [1] 23.48214
AttitudeFemale <- learning2014$Attitude[Gender==2]
mean(PointsFemale)
## [1] 22.32727
data <- data.frame(
type = c( rep("Male", 56), rep("Female", 110) ),
value = c(AttitudeMale, AttitudeFemale)
)
ggplot(data, aes(x=value, fill=type)) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
The minority prefers surface learning approach to learning in the statistics course, whereas the majority prefers the methods of deep learning. Moreover, the greatest variation in learning methods is in strategic learning. Deep and surface learning is almost identically distributed expect deep learning has a greater mean.
# Package
library(ggplot2)
# Histrograms
Deep <- learning2014$Deep
Stra <- learning2014$Stra
Surf <- learning2014$Surf
data <- data.frame(
type = c( rep("Deep", 166), rep("Stra", 166), rep("Surf", 166) ),
value = c(Deep, Stra, Surf)
)
ggplot(data, aes(x=value, fill=type)) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
There is no much difference between genders in learning methods; the distributions between male and female responders are almost indentically distributed.
# Histrograms
# Strategic Learning
StraMale <- learning2014$Stra[Gender==1]
StraFemale <- learning2014$Stra[Gender==2]
data <- data.frame(
type = c( rep("Male", 56), rep("Female", 110) ),
value = c(StraMale, StraFemale)
)
ggplot(data, aes(x=value, fill=type)) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
# Deep Learning
DeepMale <- learning2014$Deep[Gender==1]
DeepFemale <- learning2014$Deep[Gender==2]
data <- data.frame(
type = c( rep("Male", 56), rep("Female", 110) ),
value = c(DeepMale, DeepFemale)
)
ggplot(data, aes(x=value, fill=type)) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
# Surface Learning
SurfMale <- learning2014$Surf[Gender==1]
SurfFemale <- learning2014$Surf[Gender==2]
data <- data.frame(
type = c( rep("Male", 56), rep("Female", 110) ),
value = c(SurfMale, SurfFemale)
)
ggplot(data, aes(x=value, fill=type)) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
Next we run regressions where exam points is our dependent variable. Let us first look at the correlations between exam points and the explanatory variables.
plot(Attitude, Points, main = "Regression",
xlab = "Attitude", ylab = "Points") + abline(lm(Points~Attitude), col = "blue")
## integer(0)
plot(Gender, Points, main = "Regression",
xlab = "Gender", ylab = "Points") + abline(lm(Points~Gender), col = "blue")
## integer(0)
plot(Age, Points, main = "Regression",
xlab = "Age", ylab = "Points") + abline(lm(Points~Age), col = "blue")
## integer(0)
plot(Deep, Points, main = "Regression",
xlab = "Deep Learning", ylab = "Points") + abline(lm(Points~Deep), col = "blue")
## integer(0)
plot(Stra, Points, main = "Regression",
xlab = "Strategic Learning", ylab = "Points") + abline(lm(Points~Stra), col = "blue")
## integer(0)
plot(Surf, Points, main = "Regression",
xlab = "Surface Learning", ylab = "Points") + abline(lm(Points~Surf), col = "blue")
## integer(0)
cov(learning2014)
## Gender Age Attitude Deep Stra Surf
## Gender 0.22489960 -0.4383352 -0.10184739 -0.01526713 0.05326762 0.02826457
## Age -0.43833516 60.3119752 0.12584520 0.10792260 0.61406079 -0.58075332
## Attitude -0.10184739 0.1258452 0.53276561 0.04458987 0.03478322 -0.06776013
## Deep -0.01526713 0.1079226 0.04458987 0.30706766 0.04127419 -0.09489017
## Stra 0.05326762 0.6140608 0.03478322 0.04127419 0.59572437 -0.06570525
## Surf 0.02826457 -0.5807533 -0.06776013 -0.09489017 -0.06570525 0.27967222
## Points -0.25972983 -4.2662651 1.87824388 -0.03315078 0.66483662 -0.45002434
## Points
## Gender -0.25972983
## Age -4.26626506
## Attitude 1.87824388
## Deep -0.03315078
## Stra 0.66483662
## Surf -0.45002434
## Points 34.74965316
From the single regressions and the covariance matrix we observe that there is negative correlation between points and gender, age, deep learning and surface learning. That is, what we already observed, men were doing better in the course than women. Moreover, it seems to be so that younger students got better exam results than the older participants. Strategic learning was the only method that has positive correlation between exam results. There is a big positive correaltion between attitude and points.
Next we choose first gender, attitude and age to regress exam points.
reg <- lm(Points~Gender+Attitude+Age)
summary(reg)
##
## Call:
## lm(formula = Points ~ Gender + Attitude + Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4590 -3.3221 0.2186 4.0247 10.4632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.76802 3.17695 4.019 8.93e-05 ***
## Gender 0.33054 0.91934 0.360 0.720
## Attitude 3.60657 0.59322 6.080 8.34e-09 ***
## Age -0.07586 0.05367 -1.414 0.159
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared: 0.2018, Adjusted R-squared: 0.187
## F-statistic: 13.65 on 3 and 162 DF, p-value: 5.536e-08
From here we observe that attitude is the only significant explanatory variable by itself. However, by the F-test all three variables are together significant explanatories.
Nevertheless, following the instructions we drop off gender variable and run the regression again.
# Regression
reg <- lm(Points~Attitude+Age)
summary(reg)
##
## Call:
## lm(formula = Points ~ Attitude + Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.3354 -3.3095 0.2625 4.0005 10.4911
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.57244 2.24943 6.034 1.04e-08 ***
## Attitude 3.54392 0.56553 6.267 3.17e-09 ***
## Age -0.07813 0.05315 -1.470 0.144
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.301 on 163 degrees of freedom
## Multiple R-squared: 0.2011, Adjusted R-squared: 0.1913
## F-statistic: 20.52 on 2 and 163 DF, p-value: 1.125e-08
Attitude remains to be the strongest predictor. Age is still unsignifant even though together with attitude it is significant (F-test). It turns out, after trying all possible combinations of explanatory variables, only regressing with attitude we get 5 % significant level of all regressors. We thus lastly regress only with attitude and get the following.
reg <- lm(Points~Attitude)
summary(reg)
##
## Call:
## lm(formula = Points ~ Attitude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## Attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
From the summary we observe that the intercept term is positive and significant: 11.63 with standard error of 1.83. This is the test score if attitude was “zero”. Each unit jump in attitution increases (not causal) exam results by 3.5 points. Hence, with maximum attitude our model predicts that the test result is \(11.63 + 5 * 3.5 = 29\) points.
We observe also that R-squared decreases a little everytime we omit an explanatory variable from the regression. R-squared is the measure that represents the proportion of the variance for the dependent variable that is explained by explanatory variables. The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model.
Lastly we plot the regression diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.
plot(reg)
Residuals vs Fitted values figure is a simple scatterplot between residuals and predicted values. It should look more or less random, which is roughly the case in our analysis. The red line is not exactly zero all the time, which is a sign of that the residuals may have a little positive predictive power (omitted variable bias).
The Normal QQ plot helps us to assess whether the residuals are roughly normally distributed. In our cases the tails scarper from the diagonal which may imply that we have some problems with assumption of normally distributed error terms. Probably some t-distribution could be better.
The last plot for the residuals vs leverage (Cook’s distance) tells us which points have the greatest influence on the regression (leverage points). It turns out that points 35, 56, and 71 have the strongest effects on the dependent variable.
For the causal inference we need to have very strong assumptions.
Before going to the assumptions, we introduce the vector and matrix notation. Define K-dimensional (column) vectors \(\boldsymbol{x}_i\) and \(\boldsymbol{\beta}\) as
\[ \underset{(K \times 1)}{\boldsymbol{x}_i} = \begin{pmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{iK} \end{pmatrix}, \underset{(K \times 1)}{\boldsymbol{\beta}} = \begin{pmatrix} \beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{K} \end{pmatrix}.\] Also define \[ \underset{(n \times 1)}{\boldsymbol{y}} = \begin{pmatrix} y_{1} \\ \vdots \\ y_{n} \end{pmatrix}, \underset{(n \times 1)}{\boldsymbol{\varepsilon}} = \begin{pmatrix} \varepsilon_{1} \\ \vdots \\ \varepsilon_{n} \end{pmatrix}, \underset{(n \times K)}{\boldsymbol{X}} = \begin{pmatrix} \boldsymbol{x}'_{1} \\ \vdots \\ \boldsymbol{x}'_{n} \end{pmatrix} = \begin{pmatrix} x_{11} & \dots & x_{1K} \\ \vdots & \ddots & \vdots \\ x_{n1} & \dots & x_{nK} \end{pmatrix}. \] Scalar quantities will be given in normal font \(x\), vector quantities in bold lowercase \(\boldsymbol{x}\), and all vectors will be presumed to be column vectors. Matrix quantities will be in bold uppercase \(\boldsymbol{X}\). The transpose of a matrix is denoted by either \(\boldsymbol{X}'\) or \(\boldsymbol{X}^T\).
Using the matrix notation, the assumptions for the causal inference are the following
Assumption 1.1 (linearity): \[\underset{(n \times 1)}{\boldsymbol{y}} = \underset{\underbrace{(n \times K)(K \times 1)}_{(n \times 1)}}{\boldsymbol{X} \: \: \: \boldsymbol{\beta}} + \underset{(n \times 1)}{\boldsymbol{\varepsilon}}.\]
Assumption 1.2 (strict exogeneity): \[\mathbb{E}(\varepsilon_{i}|\boldsymbol{X}) = 0 \: \: \: (i = 1, 2, \dots, n).\]
Assumption 1.3 (no multicollinearity): The rank of the \(n \times K\) data matrix, \(\boldsymbol{X}\), is \(K\) with probability 1.
Assumption 1.4 (spherical error variance): \[(\text{homoskedasticity}) \: \: \mathbb{E}(\varepsilon_{i}^2|\boldsymbol{X}) = \sigma^2 > 0 \: \: \: (i = 1, 2, \dots, n),\] \[(\text{no correlation between observations}) \: \: \: \mathbb{E}(\varepsilon_{i}\varepsilon_{j}|\boldsymbol{X}) = 0 \: \: \: (i, j = 1, 2, \dots, n; i \neq j).\]
Even though our regression diagnostic plots are promising, I would not make any kind of causal interpretations from our model.
Today is
## [1] "Mon Nov 16 16:15:17 2020"
This week we use the Boston data from the MASS package. The data have 14 variables and 506 observartions for each variable. The variables are either numerical or integers.
# Package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data(Boston)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Namely, the variables are
The correlation matrix can be illustrated by using corrplot package as follows
library(corrplot)
## corrplot 0.84 loaded
corr_matrix<-cor(Boston)
corrplot(corr_matrix)
That is, there is a a significant positive correlation between crim, rad, tax and lsat. On the other hand, the weighted mean of distances to five Boston employment centres, dis, has negative correlation with indus, nox, and age.
However, since the crime rate seems to be the variable in interest, let us illustrate it by some graphics.
require(ggplot2)
require(plotly)
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(data = Boston, x = ~crim, type = "histogram")
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
plot_ly(data = Boston, x = ~rad, y = ~crim)
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
plot_ly(data = Boston, x = ~tax, y = ~crim)
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
plot_ly(data = Boston, x = ~lstat, y = ~crim)
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
Next we standardize the dataset and print out summaries of the scaled data.
scaled.Boston <- data.frame(scale(Boston))
Now all variables have mean of \(0\) and variance \(1\). For instance, now the plot of crim and lstat is the following.
plot_ly(data = scaled.Boston, x = ~lstat, y = ~crim)
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
Next we divide crim in 5 categories based on the 20/%-quantiles.
library(gtools)
q.crim <- quantcut(scaled.Boston$crim,q = 5)
summary(q.crim)
## [-0.419,-0.413] (-0.413,-0.403] (-0.403,-0.356] (-0.356,0.229] (0.229,9.92]
## 102 101 101 101 101
scaled.Boston$crim <- q.crim
We divide the dataset into train and test sets, so that 80/% of the data belongs to the train set.
sample <- sample.int(n = nrow(scaled.Boston), size = floor(.8*nrow(scaled.Boston)), replace = F)
train <- scaled.Boston[sample, ]
test <- scaled.Boston[-sample, ]
The linear discriminant analysis:
# MASS and train are available
# linear discriminant analysis
lda.fit <- lda(crim~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crim ~ ., data = train)
##
## Prior probabilities of groups:
## [-0.419,-0.413] (-0.413,-0.403] (-0.403,-0.356] (-0.356,0.229] (0.229,9.92]
## 0.2103960 0.2004950 0.1856436 0.1955446 0.2079208
##
## Group means:
## zn indus chas nox rm
## [-0.419,-0.413] 1.21775653 -0.9627119 -0.04073494 -0.9360967 0.47451291
## (-0.413,-0.403] -0.08228945 -0.3604520 -0.07790436 -0.5828091 0.08707506
## (-0.403,-0.356] -0.30915715 -0.1593473 0.14762829 -0.2926223 -0.01118825
## (-0.356,0.229] -0.42211036 0.5824967 0.17620134 0.8473678 -0.16650436
## (0.229,9.92] -0.48724019 1.0149946 -0.13171834 1.0289602 -0.40874880
## age dis rad tax ptratio
## [-0.419,-0.413] -0.92544401 1.04418412 -0.70353683 -0.7335203 -0.4832093
## (-0.413,-0.403] -0.36844110 0.23833631 -0.60046672 -0.5685973 -0.1853223
## (-0.403,-0.356] 0.02495419 0.06662518 -0.48573345 -0.4356537 0.1030664
## (-0.356,0.229] 0.63935981 -0.54164278 0.09536178 0.2263908 -0.2443255
## (0.229,9.92] 0.84411276 -0.89802125 1.65960290 1.5294129 0.8057784
## black lstat medv
## [-0.419,-0.413] 0.3798922 -0.79190793 0.48801994
## (-0.413,-0.403] 0.3591934 -0.35283360 0.25362611
## (-0.403,-0.356] 0.2961959 0.05984219 0.06746983
## (-0.356,0.229] -0.1385576 0.10660048 0.05231187
## (0.229,9.92] -0.8321791 1.04493383 -0.88712672
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4
## zn -0.06064150 0.909102546 0.815167417 -0.30785188
## indus 0.07377266 -0.241636190 0.089557223 0.38544820
## chas -0.04372562 0.036973335 -0.009766943 -0.37059360
## nox 0.39875061 -0.842801427 1.345923605 -0.71798146
## rm 0.13448510 0.328819214 -0.258092350 -0.33956212
## age 0.15596111 -0.294379525 0.375879207 -0.39361747
## dis -0.12029184 -0.541535217 0.464810158 -0.71258238
## rad 1.74850208 1.000597656 -0.058187120 0.56260148
## tax 0.01150452 -0.172106779 -0.149102454 0.01705955
## ptratio -0.06689079 -0.038192933 -0.098551323 -0.82591418
## black -0.17864125 0.002611965 -0.112496628 -0.02563059
## lstat 0.27124585 -0.052097152 -0.937986408 -0.94884908
## medv -0.06649171 -0.858984349 -0.041792506 -0.42127718
##
## Proportion of trace:
## LD1 LD2 LD3 LD4
## 0.8429 0.1012 0.0497 0.0062
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crim)
# plot the lda results
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)
Next we cross tabulate the results with the crime categories from the test set.
# lda.fit, correct_classes and test are available
correct_classes <- test$crim
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct [-0.419,-0.413] (-0.413,-0.403] (-0.403,-0.356]
## [-0.419,-0.413] 7 7 3
## (-0.413,-0.403] 13 7 0
## (-0.403,-0.356] 2 11 12
## (-0.356,0.229] 0 1 7
## (0.229,9.92] 0 0 0
## predicted
## correct (-0.356,0.229] (0.229,9.92]
## [-0.419,-0.413] 0 0
## (-0.413,-0.403] 0 0
## (-0.403,-0.356] 1 0
## (-0.356,0.229] 6 8
## (0.229,9.92] 2 15
Let us next reload the Boston dataset and standardize it. Then we calculate the distances between the observations.
# load MASS and Boston
library(MASS)
data('Boston')
# standardization
Boston <- scale(Boston)
# euclidean distance matrix
dist_eu <- dist(Boston)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(Boston, method = "manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. The optimal number of clusters is when the value of total WCSS changes radically. In this case, two clusters would seem optimal.
# MASS, ggplot2 and Boston dataset are available
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <-kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
(more chapters to be added similarly as we proceed with the course!)